Skip to content

[W8A8 Block Linear Refactor][2/N] Remove W8A8Fp8BlockLinearOp and adopt Fp8 block linear kernel selections.#33892

Merged
tjtanaa merged 91 commits intovllm-project:mainfrom
EmbeddedLLM:3n-block-scaled-rfc-pr
Apr 9, 2026
Merged

[W8A8 Block Linear Refactor][2/N] Remove W8A8Fp8BlockLinearOp and adopt Fp8 block linear kernel selections.#33892
tjtanaa merged 91 commits intovllm-project:mainfrom
EmbeddedLLM:3n-block-scaled-rfc-pr

Conversation

@maralbahari
Copy link
Copy Markdown
Contributor

@maralbahari maralbahari commented Feb 5, 2026

Purpose

This PR refactors block scaled linear kernel into kernel abstraction.

changes:

  • Introduces MMLinearKernel base interface for all linear kernels.
  • Introduces Params, Fp8Params and Int8Params, classes to access layer params in structured format.
  • Introduces DynamicMMLinearKernel which is a type of MMLinearKernel with two main properties of base and fallback kernels that are variant of MMLinearKernel. this class switches between base and fallback
    implementations at runtime.
  • Removing the legacy W8A8BlockFp8LinearOp class.
  • Unifying kernel selection for both block and non-block quantization
  • Updating all consumers (fp8.py, modelopt.py, tests, benchmarks)

Test Plan

Cuda platfrom
run ci/cd tests.

ROCm platform:
lm_eval score RedHatAI/Qwen3-30B-A3B-FP8-block

Test Result

ROCm platform:
lm_eval score RedHatAI/Qwen3-30B-A3B-FP8-block, without AITER

Tasks Version Filter n-shot Metric Value Stderr
gsm8k 3 flexible-extract 5 exact_match 0.8196 ± 0.0106
strict-match 5 exact_match 0.8954 ± 0.0084

W8A8 Block Linear Refactor PRs:


Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Signed-off-by: maral <maralbahari.98@gmail.com>
Signed-off-by: maral <maralbahari.98@gmail.com>
@mergify mergify bot added performance Performance-related issues nvidia labels Feb 5, 2026
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a significant and well-designed refactoring of the FP8 block-scaled linear kernel integration. By removing the monolithic W8A8BlockFp8LinearOp and introducing a new kernel abstraction layer with MMLinearKernel, the code becomes much more modular, maintainable, and extensible. The new kernel selection mechanism in init_fp8_linear_kernel is clear and correctly dispatches to different kernel implementations based on the quantization configuration. The changes are consistently applied across benchmarks, tests, and model implementation files.

I've found a few issues, including a critical one that would cause a runtime error, and a couple of high-severity issues related to correctness in tests and code robustness. After addressing these, this PR will be a great improvement to the codebase.

Signed-off-by: maral <maralbahari.98@gmail.com>
maralbahari and others added 8 commits February 5, 2026 18:04
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Signed-off-by: Maral <maralbahari.98@gmail.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Signed-off-by: Maral <maralbahari.98@gmail.com>
…r.py

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Signed-off-by: Maral <maralbahari.98@gmail.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Signed-off-by: Maral <maralbahari.98@gmail.com>
…kScaledMMLinearKernel.py

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Signed-off-by: Maral <maralbahari.98@gmail.com>
Signed-off-by: maral <maralbahari.98@gmail.com>
Signed-off-by: maral <maralbahari.98@gmail.com>
…block-scaled-rfc-pr

Signed-off-by: maral <maralbahari.98@gmail.com>
@mergify
Copy link
Copy Markdown
Contributor

mergify bot commented Feb 6, 2026

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @maralbahari.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Feb 6, 2026
…ement for cutlass and fix type error in dynamic deepgemm/flash-infer

Signed-off-by: maral <maralbahari.98@gmail.com>
Signed-off-by: maral <maralbahari.98@gmail.com>
Signed-off-by: maral <maralbahari.98@gmail.com>
@mergify mergify bot removed the needs-rebase label Feb 9, 2026
Signed-off-by: maral <maralbahari.98@gmail.com>
…block-scaled-rfc-pr

Signed-off-by: maral <maralbahari.98@gmail.com>
@maralbahari maralbahari marked this pull request as ready for review February 23, 2026 02:09
@mergify
Copy link
Copy Markdown
Contributor

mergify bot commented Apr 6, 2026

Hi @maralbahari, the pre-commit checks have failed. Please run:

uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy failing?
mypy is run differently in CI. If the failure is related to this check, please use the following command to run it locally:
# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10

Signed-off-by: maral <maralbahari.98@gmail.com>
@mergify
Copy link
Copy Markdown
Contributor

mergify bot commented Apr 6, 2026

Hi @maralbahari, the pre-commit checks have failed. Please run:

uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy failing?
mypy is run differently in CI. If the failure is related to this check, please use the following command to run it locally:
# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10

Signed-off-by: maral <maralbahari.98@gmail.com>
Signed-off-by: maral <maralbahari.98@gmail.com>
Signed-off-by: maral <maralbahari.98@gmail.com>
Copy link
Copy Markdown
Collaborator

@tjtanaa tjtanaa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@tjtanaa tjtanaa merged commit 2e9034c into vllm-project:main Apr 9, 2026
140 checks passed
@github-project-automation github-project-automation bot moved this from Ready to Done in NVIDIA Apr 9, 2026
@github-project-automation github-project-automation bot moved this from Todo to Done in AMD Apr 9, 2026
jdebache pushed a commit to jdebache/vllm that referenced this pull request Apr 9, 2026
…pt Fp8 block linear kernel selections. (vllm-project#33892)

Signed-off-by: maral <maralbahari.98@gmail.com>
Signed-off-by: Maral <maralbahari.98@gmail.com>
Elm8116 pushed a commit to Elm8116/vllm that referenced this pull request Apr 9, 2026
…pt Fp8 block linear kernel selections. (vllm-project#33892)

Signed-off-by: maral <maralbahari.98@gmail.com>
Signed-off-by: Maral <maralbahari.98@gmail.com>
Monishver11 pushed a commit to Monishver11/vllm that referenced this pull request Apr 9, 2026
…pt Fp8 block linear kernel selections. (vllm-project#33892)

Signed-off-by: maral <maralbahari.98@gmail.com>
Signed-off-by: Maral <maralbahari.98@gmail.com>
mtparet pushed a commit to blackfuel-ai/vllm that referenced this pull request Apr 9, 2026
…pt Fp8 block linear kernel selections. (vllm-project#33892)

Signed-off-by: maral <maralbahari.98@gmail.com>
Signed-off-by: Maral <maralbahari.98@gmail.com>
jackcfwang pushed a commit to jackcfwang/vllm that referenced this pull request Apr 10, 2026
…pt Fp8 block linear kernel selections. (vllm-project#33892)

Signed-off-by: maral <maralbahari.98@gmail.com>
Signed-off-by: Maral <maralbahari.98@gmail.com>
Signed-off-by: jackcfwang <jackcfwang@tencent.com>
jackcfwang pushed a commit to jackcfwang/vllm that referenced this pull request Apr 10, 2026
…pt Fp8 block linear kernel selections. (vllm-project#33892)

Signed-off-by: maral <maralbahari.98@gmail.com>
Signed-off-by: Maral <maralbahari.98@gmail.com>
Signed-off-by: jackcfwang <jackcfwang@tencent.com>
Natfii added a commit to Navi-AI-Lab/nvllm that referenced this pull request Apr 10, 2026
Required by NvFp4LinearKernel refactor (vllm-project#39129). Copied from upstream/main
rather than cherry-picking the full W8A8 block linear refactor (vllm-project#33892, 35 files).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Natfii added a commit to Navi-AI-Lab/nvllm that referenced this pull request Apr 10, 2026
…selections (vllm-project#33892)

Cherry-picked from upstream vllm-project/vllm@2e9034c99.
Required dependency for NvFp4LinearKernel refactor (vllm-project#39129) — provides
base.py, block-scaled kernel classes, and updated FP8 utils.
Also synced nvfp4_emulation_utils.py for kE2M1ToFloat_handle.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Natfii added a commit to Navi-AI-Lab/nvllm that referenced this pull request Apr 10, 2026
Previous cherry-pick of vllm-project#33892 overwrote NVFP4 exports from vllm-project#39129.
Synced to upstream/main which has both FP8 block and NVFP4 kernel exports.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
iboiko-habana pushed a commit to vllm-project/vllm-gaudi that referenced this pull request Apr 13, 2026
…stream regressions in attention, FP8, offloading and platform (#1338)

## Summary

Fixes five regressions introduced by recent upstream vLLM changes that
break HPU unit tests and model execution.

## Changes

1. **Remove `use_output` guard from HPU attention patch** — attribute
removed upstream
2. **Remove `accept_output_buffer` branching from HPU MLA attention** —
attribute removed upstream; unconditionally use output buffer in opaque
path, direct call path manages output internally
3. **Update KV offloading connector tests** — field renames:
`block_hashes` → `keys`, `block_hashes_to_store` → `keys_to_store`,
config access via `kv_group_configs[0]`
4. **Register HPU FP8 block-scaled kernel + add ops test conftest** —
new `_POSSIBLE_FP8_BLOCK_KERNELS` dict needs OOT entry; provide
`VllmConfig` stub for ops unit tests
5. **Add `manual_seed_all` to `HpuPlatform`** — new required platform
method for RNG seeding

## Upstream PRs that introduced these regressions

- vllm-project/vllm#39125 — removed
`accept_output_buffer` and `use_output` from attention layer (fixes 1,
2)
- vllm-project/vllm#37109 — restructured
`OffloadingConnectorScheduler` API (fix 3)
- vllm-project/vllm#33892 — added
`model_config.dtype` access in `Fp8LinearMethod.__init__` and
`_POSSIBLE_FP8_BLOCK_KERNELS` (fix 4)
- vllm-project/vllm#38468 — added
`manual_seed_all` as required abstract method on `Platform` (fix 5)

---------

Signed-off-by: Paweł Olejniczak <pawelx.olejniczak@intel.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

nvidia performance Performance-related issues ready ONLY add when PR is ready to merge/full CI is needed ready-run-all-tests Trigger CI with all tests for wide-ranging PRs rocm Related to AMD ROCm

Projects

Status: Done
Status: Done

Development

Successfully merging this pull request may close these issues.

4 participants